{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Evaluating a Model](11.01-Evaluating-a-Model.ipynb) | [Contents](../README.md) | [Tuning Hyperparameters with Grid Search](11.03-Tuning-Hyperparameters-with-Grid-Search.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding Cross-Validation\n",
    "\n",
    "Cross-validation is a method of evaluating the generalization performance of a model that\n",
    "is generally more stable and thorough than splitting the dataset into training and test sets.\n",
    "\n",
    "The most commonly used version of cross-validation is $k$-fold cross-validation, where $k$ is a\n",
    "number specified by the user (usually five or ten). Here, the dataset is partitioned into k\n",
    "parts of more or less equal size, called folds. For a dataset that contains $N$ data points, each\n",
    "fold should thus have approximately $N / k$ samples. Then a series of models is trained on\n",
    "the data, using $k - 1$ folds for training and one remaining fold for testing. The procedure is\n",
    "repeated for $k$ iterations, each time choosing a different fold for testing, until every fold has\n",
    "served as a test set once.\n",
    "\n",
    "Refer to the book for an illustration of $k$-fold cross-validation for different values of $k$. Do you know what makes cross-validation different from just splitting the data into training and test sets?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Manually implementing cross-validation in OpenCV\n",
    "\n",
    "The easiest way to perform cross-validation in OpenCV is to do the data splits by hand.\n",
    "For example, in order to implement two-fold cross-validation, we would follow the\n",
    "following procedure.\n",
    "\n",
    "Load the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "import numpy as np\n",
    "iris = load_iris()\n",
    "X = iris.data.astype(np.float32)\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the data into two equally sized parts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_fold1, X_fold2, y_fold1, y_fold2 = train_test_split(\n",
    "    X, y, random_state=37, train_size=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instantiate the classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import cv2\n",
    "knn = cv2.ml.KNearest_create()\n",
    "knn.setDefaultK(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train the classifier on the first fold, then predict the labels of the second fold:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "knn.train(X_fold1, cv2.ml.ROW_SAMPLE, y_fold1)\n",
    "_, y_hat_fold2 = knn.predict(X_fold2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train the classifier on the second fold, then predict the labels of the first fold:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "knn.train(X_fold2, cv2.ml.ROW_SAMPLE, y_fold2)\n",
    "_, y_hat_fold1 = knn.predict(X_fold1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute accuracy scores for both folds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.92000000000000004"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "accuracy_score(y_fold1, y_hat_fold1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.88"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(y_fold2, y_hat_fold2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This procedure will yield two accuracy scores, one for the first fold (92% accuracy), and one\n",
    "for the second fold (88% accuracy). On average, our classifier thus achieved 90% accuracy\n",
    "on unseen data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automating cross-validation using scikit-learn\n",
    "\n",
    "Instantiate the classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "model = KNeighborsClassifier(n_neighbors=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform cross-validation with the cross_val_score function. This function\n",
    "takes as input a model, the full dataset (`X`), the target labels (`y`) and an integer\n",
    "value for the number of folds (`cv`). It is not necessary to split the data by\n",
    "hand—the function will do that automatically depending on the number of folds.\n",
    "After the cross-validation is completed, the function returns the test scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.96666667,  0.96666667,  0.93333333,  0.93333333,  1.        ])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "scores = cross_val_score(model, X, y, cv=5)\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to get a sense how the model did on average, we can look at the mean and\n",
    "standard deviation of the five scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.95999999999999996, 0.024944382578492935)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores.mean(), scores.std()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With five folds, we have a much better idea about how robust the classifier is on average.\n",
    "We see that $k$-NN with $k=1$ achieves on average 96% accuracy, and this value fluctuates\n",
    "from run to run with a standard deviation of roughly 2.5%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing leave-one-out cross-validation\n",
    "\n",
    "Another popular way to implement cross-validation is to choose the number of folds equal\n",
    "to the number of data points in the dataset. In other words, if there are $N$ data points, we set\n",
    "$k=N$. This means that we will end up having to do $N$ iterations of cross-validation, but in\n",
    "every iteration, the training set will consist of only a single data point. The advantage of this\n",
    "procedure is that we get to use all-but-one data point for training. Hence, this procedure is\n",
    "also known as leave-one-out cross-validation.\n",
    "\n",
    "In scikit-learn, this functionality is provided by the `LeaveOneOut` method from the\n",
    "`model_selection` module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import LeaveOneOut"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This object can be passed directly to the `cross_val_score` function in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cross_val_score(model, X, y, cv=LeaveOneOut())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because every test set now contains a single data point, we would expect the scorer to\n",
    "return 150 values—one for each data point in the dataset. Each of these points we could get\n",
    "either right or wrong. Thus, we expect `scores` to be a list of ones (1) and zeros (0), which\n",
    "corresponds to correct and incorrect classifications, respectively:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,\n",
       "        1.,  1.,  1.,  1.,  1.,  1.,  1.])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to know the average performance of the classifier, we would still compute the\n",
    "mean and standard deviation of the scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.95999999999999996, 0.19595917942265423)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores.mean(), scores.std()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see this scoring scheme returns very similar results to five-fold cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Estimating robustness using bootstrapping\n",
    "\n",
    "An alternative procedure to $k$-fold cross-validation is **bootstrapping**.\n",
    "\n",
    "Instead of splitting the data into folds, bootstrapping builds a training set by drawing\n",
    "samples randomly from the dataset. Typically, a bootstrap is formed by drawing samples\n",
    "with replacement. Imagine putting all of the data points into a bag and then drawing\n",
    "randomly from the bag. After drawing a sample, we would put it back in the bag. This\n",
    "allows for some samples to show up multiple times in the training set, which is something\n",
    "cross-validation does not allow.\n",
    "\n",
    "The classifier is then tested on all samples that are not part of the bootstrap (the so-called\n",
    "**out-of-bag** examples), and the procedure is repeated a large number of times (say, 10,000\n",
    "times). Thus, we get a distribution of the model's score that allows us to estimate the\n",
    "robustness of the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bootstrapping can be implemented with the following procedure.\n",
    "\n",
    "Instantiate the classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "knn = cv2.ml.KNearest_create()\n",
    "knn.setDefaultK(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From our dataset with $N$ samples, randomly choose $N$ samples with replacement\n",
    "to form a bootstrap. This can be done most easily with the choice function from\n",
    "NumPy's random module. We tell the function to draw len(`X`) samples in the\n",
    "range `[0, len(X)-1]` with replacement (`replace=True`). The function then\n",
    "returns a list of indices, from which we form our bootstrap:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "idx_boot = np.random.choice(len(X), size=len(X), replace=True)\n",
    "X_boot = X[idx_boot, :]\n",
    "y_boot = y[idx_boot]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Put all samples that do not show in the bootstrap in the out-of-bag set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "idx_oob = np.array([x not in idx_boot\n",
    "                    for x in np.arange(len(X))], dtype=np.bool)\n",
    "X_oob = X[idx_oob, :]\n",
    "y_oob = y[idx_oob]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train the classifier on the bootstrap samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test the classifier on the out-of-bag samples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.92727272727272725"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "_, y_hat = knn.predict(X_oob)\n",
    "accuracy_score(y_oob, y_hat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we want to repeat these steps up to 10,000 times to get 10,000\n",
    "accuracy scores, then average the scores to get an idea of the classifier's mean\n",
    "performance.\n",
    "\n",
    "For our convenience, we can build a function so that it is easy to run the\n",
    "procedure for some `n_iter` number of times. We also pass a model (our $k$-NN classifier,\n",
    "`model`), the feature matrix (`X`), and the vector with all class labels (`y`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def yield_bootstrap(model, X, y, n_iter=10000):\n",
    "    for _ in range(n_iter):\n",
    "        # train the classifier on bootstrap\n",
    "        idx_boot = np.random.choice(len(X), size=len(X),\n",
    "                                    replace=True)\n",
    "        X_boot = X[idx_boot, :]\n",
    "        y_boot = y[idx_boot]\n",
    "        model.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)\n",
    "        \n",
    "        # test classifier on out-of-bag examples\n",
    "        idx_oob = np.array([x not in idx_boot\n",
    "                            for x in np.arange(len(X))],\n",
    "                           dtype=np.bool)\n",
    "        X_oob = X[idx_oob, :]\n",
    "        y_oob = y[idx_oob]\n",
    "        _, y_hat = model.predict(X_oob)\n",
    "        \n",
    "        # return accuracy\n",
    "        yield accuracy_score(y_oob, y_hat)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make sure we all get the same result, let's fix the seed of the random number generator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's run the procedure for `n_iter=10` times by converting the function output to a\n",
    "list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.98333333333333328,\n",
       " 0.93650793650793651,\n",
       " 0.92452830188679247,\n",
       " 0.92307692307692313,\n",
       " 0.94545454545454544,\n",
       " 0.94736842105263153,\n",
       " 0.98148148148148151,\n",
       " 0.96078431372549022,\n",
       " 0.93220338983050843,\n",
       " 0.96610169491525422]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(yield_bootstrap(knn, X, y, n_iter=10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, for this small sample we get accuracy scores anywhere between 92% and\n",
    "98%. To get a more reliable estimate of the model's performance, we repeat the procedure\n",
    "1,000 times and calculate both mean and standard deviation of the resulting scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.95524155136419198, 0.022040380995646654)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acc = list(yield_bootstrap(knn, X, y, n_iter=1000))\n",
    "np.mean(acc), np.std(acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You are always welcome to increase the number of repetitions. But once `n_iter` is large\n",
    "enough, the procedure should be robust to the randomness of the sampling procedure. In\n",
    "this case, we do not expect to see any more changes to the distribution of score values as we\n",
    "keep increasing `n_iter` to, for example, 10,000 iterations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.95501528733009422, 0.021778543317079499)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acc = list(yield_bootstrap(knn, X, y, n_iter=10000))\n",
    "np.mean(acc), np.std(acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typically, the scores obtained with bootstrapping would be used in a **statistical test** to\n",
    "assess the **significance** of our result. Let's have a look at how that is done."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing Student's t-test\n",
    "\n",
    "One of the most famous statistical tests is **Student's $t$-test**. You might have heard of it\n",
    "before: it allows us to determine whether two sets of data are significantly different from\n",
    "one another. This was a really important test for William Sealy Gosset, the inventor of the\n",
    "test, who worked at the Guinness brewery and wanted to know whether two batches of\n",
    "stout differed in quality.\n",
    "\n",
    "In practice, the $t$-test allows us to determine whether two data samples come from\n",
    "underlying distributions with the same mean or **expected value**.\n",
    "\n",
    "For our purposes, this means that we can use the $t$-test to determine whether the test scores\n",
    "of two independent classifiers have the same mean value. We start by hypothesizing that\n",
    "the two sets of test scores are identical. We call this the **null hypothesis** because this is the\n",
    "hypothesis we want to nullify, that is, we are looking for evidence to **reject** the hypothesis\n",
    "because we want to ensure that one classifier is significantly better than the other.\n",
    "\n",
    "We accept or reject a null hypothesis based on a parameter known as the $p$-value that the $t$-test\n",
    "returns. The $p$-value takes on values between 0 and 1. A $p$-value of 0.05 would mean\n",
    "that the null hypothesis is right only 5 out of 100 times. A small $p$-value thus indicates\n",
    "strong evidence that the hypothesis can be safely rejected. It is customary to use $p=0.05$ as a\n",
    "cut-off value below which we reject the null hypothesis.\n",
    "\n",
    "If this is all too confusing, think of it this way: when we run a $t$-test for the purpose of\n",
    "comparing classifier test scores, we are looking to obtain a small $p$-value because that means\n",
    "that the two classifiers give significantly different results.\n",
    "\n",
    "We can implement Student's $t$-test with SciPy's `ttest_ind` function from the `stats`\n",
    "module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.stats import ttest_ind"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with a simple example. Assume we ran five-fold cross-validation on two\n",
    "classifiers and obtained the following scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_a = [1, 1, 1, 1, 1]\n",
    "scores_b = [0, 0, 0, 0, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means that Model A achieved 100% accuracy in all five folds, whereas Model B got 0%\n",
    "accuracy. In this case, it is clear that the two results are significantly different. If we run the\n",
    "$t$-test on this data, we should thus find a really small $p$-value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=inf, pvalue=0.0)"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ttest_ind(scores_a, scores_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we do! We actually get the smallest possible $p$-value, $p=0.0$.\n",
    "\n",
    "On the other hand, what if the two classifiers got exactly the same numbers, except during\n",
    "different folds. In this case, we would expect the two classifiers to be equivalent, which is\n",
    "indicated by a really large $p$-value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=0.0, pvalue=1.0)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scores_a = [0.9, 0.9, 0.9, 0.8, 0.8]\n",
    "scores_b = [0.8, 0.8, 0.9, 0.9, 0.9]\n",
    "ttest_ind(scores_a, scores_b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analogous to the aforementioned, we get the largest possible $p$-value, $p=1.0$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see what happens in a more realistic example, let's return to our $k$-NN classifier from\n",
    "earlier example. Using the test scores obtained from the ten-fold cross-validation procedure,\n",
    "we can compare two different $k$-NN classifiers with the following procedure.\n",
    "\n",
    "Obtain a set of test scores for Model A. We choose Model A to be the $k$-NN\n",
    "classifier from earlier ($k=1$):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.95999999999999996, 0.053333333333333323)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k1 = KNeighborsClassifier(n_neighbors=1)\n",
    "scores_k1 = cross_val_score(k1, X, y, cv=10)\n",
    "np.mean(scores_k1), np.std(scores_k1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain a set of test scores for Model B. Let's choose Model B to be a $k$-NN\n",
    "classifier with $k=3$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.96666666666666656, 0.044721359549995787)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k3 = KNeighborsClassifier(n_neighbors=3)\n",
    "scores_k3 = cross_val_score(k3, X, y, cv=10)\n",
    "np.mean(scores_k3), np.std(scores_k3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apply the $t$-test to both sets of scores:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=-0.2873478855663425, pvalue=0.77712784875052965)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ttest_ind(scores_k1, scores_k3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this is a good example of two classifiers giving different cross-validation\n",
    "scores (96.0% and 96.7%) that turn out to be not significantly different! Because we get a\n",
    "large $p$-value ($p=0.777$), we expect the two classifiers to be equivalent 77 out of 100 times."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing McNemar's test\n",
    "\n",
    "A more advanced statistical technique is McNemar's test. This test can be used on paired\n",
    "data to determine whether there are any differences between the two samples. As in the\n",
    "case of the $t$-test, we can use McNemar's test to determine whether two models give\n",
    "significantly different classification results.\n",
    "\n",
    "McNemar's test operates on pairs of data points. This means that we need to know, for both\n",
    "classifiers, how they classified each data point. Based on the number of data points that the\n",
    "first classifier got right but the second got wrong and vice versa, we can determine whether\n",
    "the two classifiers are equivalent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.stats import binom\n",
    "def mcnemar_midp(b, c):\n",
    "    \"\"\"\n",
    "    Compute McNemar's test using the \"mid-p\" variant suggested by:\n",
    "    \n",
    "    M.W. Fagerland, S. Lydersen, P. Laake. 2013. The McNemar test for \n",
    "    binary matched-pairs data: Mid-p and asymptotic are better than exact \n",
    "    conditional. BMC Medical Research Methodology 13: 91.\n",
    "    \n",
    "    `b` is the number of observations correctly labeled by the first---but \n",
    "    not the second---system; `c` is the number of observations correctly \n",
    "    labeled by the second---but not the first---system.\n",
    "    \"\"\"\n",
    "    n = b + c\n",
    "    x = min(b, c)\n",
    "    dist = binom(n, .5)\n",
    "    p = 2. * dist.cdf(x)\n",
    "    midp = p - dist.pmf(x)\n",
    "    return midp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assume the preceding Model A and Model B were applied to the same five data points.\n",
    "Whereas Model A classified every data point correctly (denoted with a 1), Model B got all of\n",
    "them wrong (denoted with a 0):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "scores_a = np.array([1, 1, 1, 1, 1])\n",
    "scores_b = np.array([0, 0, 0, 0, 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "McNemar's test wants to know two things:\n",
    "- How many data points did Model A get right but Model B get wrong?\n",
    "- How many data points did Model A get wrong but Model B get right?\n",
    "\n",
    "We can check which data points Model A got right but Model B got wrong as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 1])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1_b0 = scores_a * (1 - scores_b)\n",
    "a1_b0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, this applies to all of the data points. The opposite is true for the data points that\n",
    "Model B got right and Model A got wrong:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 0])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a0_b1 = (1 - scores_a) * scores_b\n",
    "a0_b1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feeding these numbers to McNemar's test should return a small $p$-value because the two\n",
    "classifiers are obviously different:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.03125"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mcnemar_midp(a1_b0.sum(), a0_b1.sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And it does!\n",
    "\n",
    "We can apply McNemar's test to a more complicated example, but we cannot operate on\n",
    "cross-validation scores anymore. The reason is that we need to know the classification result\n",
    "for every data point, not just an average. Hence, it makes more sense to apply McNemar's\n",
    "test to the leave-one-out cross-validation.\n",
    "\n",
    "Going back to $k$-NN with $k=1$ and $k=3$, we can calculate their scores as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_k1 = cross_val_score(k1, X, y, cv=LeaveOneOut())\n",
    "scores_k3 = cross_val_score(k3, X, y, cv=LeaveOneOut())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of data points that one of the classifiers got right but the other got wrong are as\n",
    "follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(scores_k1 * (1 - scores_k3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum((1 - scores_k3) * scores_k3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We got no differences whatsoever! Now it becomes clear why the $t$-test led us to believe\n",
    "that the two classifiers are identical. As a result, if we feed the two sums into McNemar's\n",
    "test function, we get the largest possible $p$-value, $p=1.0$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mcnemar_midp(np.sum(scores_k1 * (1 - scores_k3)),\n",
    "             np.sum((1 - scores_k1) * scores_k3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Evaluating a Model](11.01-Evaluating-a-Model.ipynb) | [Contents](../README.md) | [Tuning Hyperparameters with Grid Search](11.03-Tuning-Hyperparameters-with-Grid-Search.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
